Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35730: [C++] Add the ability to specify custom schema on a dataset write #35860

Conversation

westonpace
Copy link
Member

@westonpace westonpace commented May 31, 2023

Rationale for this change

The dataset write node previously allowed you to specify custom key/value metadata on a write node. This was added to support saving schema metadata. However, it doesn't capture field metadata or field nullability. This PR replaces that capability with the ability to specify a custom schema instead. The custom schema must have the same number of fields as the input to the write node and each field must have the same type.

What changes are included in this PR?

Added custom_schema to WriteNodeOptions and removed custom_metadata.

Are these changes tested?

Yes, I added a new C++ unit test to verify that the custom info is applied to written files.

Are there any user-facing changes?

No. Only new functionality (which is user facing)

@github-actions
Copy link

⚠️ GitHub issue #35730 has been automatically assigned in GitHub to PR creator.

@westonpace
Copy link
Member Author

Starting in draft stage as I believe we should get some python unit tests in here and @anjakefala volunteered to do some.

@anjakefala
Copy link
Collaborator

anjakefala commented May 31, 2023

So far, I added 2 basic tests, based on my understanding of the feature!

The basic case where you write a single table, which contains a field with nullability specified, passes.

Note that this one:

  # we can specify the nullability of a field through the schema                                                                                   
    pa.dataset.write_dataset(table_no_null, tempdir/"nulltest2", schema=schema_nullable)                                                             
    dataset = ds.dataset(tempdir/"nulltest2", format="parquet")                                                                                      
    assert dataset.to_table().schema.equals(schema_nullable) 

is failing for now. I did not specify nullability in the table's schema, but then specified it in write_dataset(schema=schema_nullable).

Is it expected that the returned dataset would have a field with nullability?

In this example table has a field specified with nullability, while table_no_null does not:

pa.dataset.write_dataset([table_no_null, table], tempdir/"nulltest2", schema=schema_nullable) 

The resulting schema also does not have nullability.

@anjakefala
Copy link
Collaborator

I have confirmed that

 pa.dataset.write_dataset(table, tempdir/"nulltest2", schema=schema_nullable, format="parquet") 

has a field with nullability.

The following do not:

pa.dataset.write_dataset([table_no_null, table], tempdir/"nulltest2", schema=schema_nullable, format="parquet")  

or

pa.dataset.write_dataset([table, table_no_null], tempdir/"nulltest2", schema=schema_nullable, format="parquet") 

…We can, however, create an InMemoryDataset with mixed field metadata and so I created a test for that
@westonpace
Copy link
Member Author

The following do not:

pa.dataset.write_dataset([table_no_null, table], tempdir/"nulltest2", schema=schema_nullable, format="parquet")

or

pa.dataset.write_dataset([table, table_no_null], tempdir/"nulltest2", schema=schema_nullable, format="parquet")

These lines failed for me with the following error:

pyarrow/dataset.py:936: in write_dataset
    data = InMemoryDataset(data, schema=schema)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   raise ArrowTypeError(
E   pyarrow.lib.ArrowTypeError: Item has schema
E   x: int64
E   y: int64
E   which does not match expected schema
E   x: int64 not null
E   y: int64

I thought this was supported and it took me a moment to track down what was going on. The error is actually being raised before the C++ call to write the dataset. Pyarrow is taking the two inputs (table, table_no_null) and trying to put them in an InMemoryDataset and specifying the schema. The constructor for InMemoryDataset is verifying that all the tables it has been given have the same schema and throwing an error because it was given a table whose schema does not match the dataset's schema.

If this is the same error you were getting then I think we can call this an invalid scenario and we don't have to support it (at least for this PR. Arguably, you could evolve a table into the correct schema if adding it to an InMemoryDataset but that's a different feature).

This is kind of confusing because @anjakefala and I were testing earlier and you are allowed to create an InMemoryDataset with tables / batches who have the same types / nullability but different field metadata. So I created an additional python test case for field metadata which does verify the "two tables but mixed metadata can be overridden by an explicit schema" call.

@westonpace westonpace marked this pull request as ready for review June 1, 2023 05:03
@westonpace westonpace requested a review from AlenkaF as a code owner June 1, 2023 05:03
@westonpace
Copy link
Member Author

Oh, and thank you for adding the test case :)

@anjakefala
Copy link
Collaborator

If this is the same error you were getting then I think we can call this an invalid scenario and we don't have to support it (at least for this PR. Arguably, you could evolve a table into the correct schema if adding it to an InMemoryDataset but that's a different feature).

Good to know!

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

I just have one question about whether we should also validate the names of the schema; the rest are minor code style nitpicks in the test code, for which I can also push some changes.

}
for (int field_idx = 0; field_idx < input_schema->num_fields(); field_idx++) {
if (!input_schema->field(field_idx)->type()->Equals(
custom_schema->field(field_idx)->type())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also test that the names of the fields are equal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the names should be safe. Admittedly, a user could also do this name change by inserting a project node before the write node.

I could be convinced otherwise but I don't think this does any harm and I think, as a user, I would expect this behavior, so it wouldn't be surprising that the names changed.

python/pyarrow/tests/test_dataset.py Outdated Show resolved Hide resolved
python/pyarrow/tests/test_dataset.py Outdated Show resolved Hide resolved
python/pyarrow/tests/test_dataset.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Jun 1, 2023

if (custom_schema != nullptr) {
if (custom_schema->num_fields() != input_schema->num_fields()) {
return Status::Invalid(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status::TypeError here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched. Although I will confess I'm not entirely clear on when to use one over the other. In my mind TypeError is only for cases where a dynamic_cast fails.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, to be honest for having a wrong number of fields, I would also raise a ValueError/Invalid (it's a wrong value, not a wrong type)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A schema error is a type error IMHO.

for (int field_idx = 0; field_idx < input_schema->num_fields(); field_idx++) {
if (!input_schema->field(field_idx)->type()->Equals(
custom_schema->field(field_idx)->type())) {
return Status::Invalid("The provided custom_schema specified type ",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched.


/// \brief Options to control how to write the dataset
FileSystemDatasetWriteOptions write_options;
/// \brief Optional metadata to attach to written batches
std::shared_ptr<const KeyValueMetadata> custom_metadata;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of removing this option (as a breaking change), we could in theory still allow the user to specify one of both?

(I am not using the C++ API for this, so I don't know how useful this would be / how cumbersome it is to specify the schema if you only want to specify metadata. From the DatasetWriter point of view, this is a fine change of course since there we already have the full schema)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it were a new feature I would argue it's not worth it (A user could technically use DeclarationToSchema to get the output schema of the plan leading up to the write and then attach custom metadata to that). However, given we have already released custom_metadata, and I would like Acero's API to start being stable, I suppose I should set an example. Thanks for the nudge. I have restored custom_metadata

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Jun 1, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jun 1, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 1, 2023
@westonpace
Copy link
Member Author

The failing test on AMD64 is unrelated (flight ucx test). There is still one R test running but I think this is good to go now.

cpp/src/arrow/dataset/file_base.h Outdated Show resolved Hide resolved
@@ -34,6 +34,7 @@
#include "arrow/filesystem/filesystem.h"
#include "arrow/io/file.h"
#include "arrow/util/compression.h"
#include "arrow/util/key_value_metadata.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be necessary if arrow/type_fwd.h is included.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this include and explicitly included arrow/type_fwd.h.

@pitrou
Copy link
Member

pitrou commented Jun 1, 2023

@jorisvandenbossche Do you want to take a look at the Python test changes?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 1, 2023
Copy link
Collaborator

@anjakefala anjakefala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the python test changes!

Weston took out the invalid test case, and added one that covers how the field metadata gets applied.

@jorisvandenbossche
Copy link
Member

Do you want to take a look at the Python test changes?

There were no changes to this since my previous review (and push), so all good!

@westonpace
Copy link
Member Author

I'm going to go ahead and merge if green. I think it's quite late for Joris and I believe Raul will be starting the patch build soon. If Joris finds an issue later then we can still scrap the patch release candidate for a fix (though hopefully we don't need to).

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Jun 1, 2023
@westonpace westonpace merged commit 018e7d3 into apache:main Jun 1, 2023
@westonpace westonpace added this to the 12.0.1 milestone Jun 1, 2023
raulcd pushed a commit that referenced this pull request Jun 1, 2023
… write (#35860)

### Rationale for this change

The dataset write node previously allowed you to specify custom key/value metadata on a write node.  This was added to support saving schema metadata.  However, it doesn't capture field metadata or field nullability.  This PR replaces that capability with the ability to specify a custom schema instead.  The custom schema must have the same number of fields as the input to the write node and each field must have the same type.

### What changes are included in this PR?

Added `custom_schema` to `WriteNodeOptions` and removed `custom_metadata`.

### Are these changes tested?

Yes, I added a new C++ unit test to verify that the custom info is applied to written files.

### Are there any user-facing changes?

No.  Only new functionality (which is user facing)

* Closes: #35730

Lead-authored-by: Weston Pace <[email protected]>
Co-authored-by: Nic Crane <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: anjakefala <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Weston Pace <[email protected]>
@raulcd
Copy link
Member

raulcd commented Jun 2, 2023

@github-actions crossbow submit test-r-ubuntu-22.04

@github-actions
Copy link

github-actions bot commented Jun 2, 2023

Revision: 875dfed

Submitted crossbow builds: ursacomputing/crossbow @ actions-42981f190e

Task Status
test-r-ubuntu-22.04 Github Actions

@ursabot
Copy link

ursabot commented Jun 3, 2023

Benchmark runs are scheduled for baseline = 3fe4a31 and contender = 018e7d3. 018e7d3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.8% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.33% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.6% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 018e7d3f ec2-t3-xlarge-us-east-2
[Finished] 018e7d3f test-mac-arm
[Finished] 018e7d3f ursa-i9-9960x
[Finished] 018e7d3f ursa-thinkcentre-m75q
[Finished] 3fe4a315 ec2-t3-xlarge-us-east-2
[Finished] 3fe4a315 test-mac-arm
[Finished] 3fe4a315 ursa-i9-9960x
[Finished] 3fe4a315 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Jun 4, 2023
…ataset write (apache#35860)

The dataset write node previously allowed you to specify custom key/value metadata on a write node.  This was added to support saving schema metadata.  However, it doesn't capture field metadata or field nullability.  This PR replaces that capability with the ability to specify a custom schema instead.  The custom schema must have the same number of fields as the input to the write node and each field must have the same type.

Added `custom_schema` to `WriteNodeOptions` and removed `custom_metadata`.

Yes, I added a new C++ unit test to verify that the custom info is applied to written files.

No.  Only new functionality (which is user facing)

* Closes: apache#35730

Lead-authored-by: Weston Pace <[email protected]>
Co-authored-by: Nic Crane <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: anjakefala <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Weston Pace <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python] write_dataset does not preserve non-nullable columns in schema
8 participants